The Volvo Service Desk Dataset used in this project was first released for the BPI Challenge 2013:

Ward Steeman (2013): BPI Challenge 2013, incidents. 4TU.ResearchData. Dataset. https://doi.org/10.4121/uuid:500573e6-accc-4b0c-9576-aa5468b10cee

Business Processes Analytics aka Process Mining

A lot of event data is recorded by different systems. In order to be able to “mine” (extract, preprocess and analyze) these processes, certain data, so called event data or event logs are needed. These must consist of three certain components:

The process analysis workflow consists of three iterative steps:

Event Data Extraction

Event data is extracted from one or more information systems and transfored to event logs.

Preprocessing the Data

Preprocessing is done by aggregation (removing redundant details), filtering (focusing on the analysis) and enrichment (add useful data attributes, e.g. calculated values).

Event Data Analysis

The data is analyized from three perspectives. r - The organizational perspective - focus on the actors of the process (e.g. roles of different doctors and nurses, how do they work together) - The control-flow perspective - focus on the flow and structuredness of the process (e.g. a patients journey through the emergency department) - The performance perspective - focus on the time and efficiency (e.g. how long does it take until a patient can leave the emergency department)

Different perspectives can also be combined with multivariate analysis (e.g. are there links between actors and performance issues) as well as with the inclusion of additional data attributes (e.g. custom activities, costs).

Log Data Overview

First we load the XES log file to get started:

data <- read_xes("bpi_challenge_2013_incidents.xes")

Next we want to get an overview about the available information within the event log:

  • How many events are recorded (the total number of activities to solve the incident)?
  • How many cases does the log contain (incident tickets)?
  • How many traces are represented in the log (process instances)?
  • How many distinct activities are performed (the different actions performed during a ticket lifecycle)?
  • What is the average trace length?
  • What is the time period in which the data is recorded (when did all of that happen)?
data %>% summary()
## Number of events:  65533
## Number of cases:  7554
## Number of traces:  1511
## Number of distinct activities:  4
## Average trace length:  8.675271
## 
## Start eventlog:  2010-03-31 14:59:42
## End eventlog:  2012-05-22 23:22:25
##  CASE_concept_name     activity_id       impact         
##  Length:65533       Accepted :40117   Length:65533      
##  Class :character   Completed:13867   Class :character  
##  Mode  :character   Queued   :11544   Mode  :character  
##                     Unmatched:    5                     
##                                                         
##                                                         
##                                                         
##               lifecycle_id    org_group            resource_id   
##  In Progress        :30239   Length:65533       Siebel   : 6162  
##  Awaiting Assignment:11544   Class :character   Krzysztof: 1173  
##  Resolved           : 6115   Mode  :character   Pawel    :  925  
##  Closed             : 5716                      Marcin   :  688  
##  Wait - User        : 4217                      Marika   :  605  
##  Assigned           : 3221                      Michael  :  587  
##  (Other)            : 4481                      (Other)  :55393  
##    org_role         organization country organization involved
##  Length:65533       Length:65533         Length:65533         
##  Class :character   Class :character     Class :character     
##  Mode  :character   Mode  :character     Mode  :character     
##                                                               
##                                                               
##                                                               
##                                                               
##    product          resource country     timestamp                  
##  Length:65533       Length:65533       Min.   :2010-03-31 14:59:42  
##  Class :character   Class :character   1st Qu.:2012-04-27 04:46:48  
##  Mode  :character   Mode  :character   Median :2012-05-02 14:07:37  
##                                        Mean   :2012-04-25 07:41:31  
##                                        3rd Qu.:2012-05-04 09:37:27  
##                                        Max.   :2012-05-22 23:22:25  
##                                                                     
##  activity_instance_id     .order     
##  Length:65533         Min.   :    1  
##  Class :character     1st Qu.:16384  
##  Mode  :character     Median :32767  
##                       Mean   :32767  
##                       3rd Qu.:49150  
##                       Max.   :65533  
## 
data %>% select(lifecycle_id) %>% group_by(lifecycle_id) %>% summarize()
## # A tibble: 13 x 1
##    lifecycle_id         
##    <fct>                
##  1 Assigned             
##  2 Awaiting Assignment  
##  3 Cancelled            
##  4 Closed               
##  5 In Call              
##  6 In Progress          
##  7 Resolved             
##  8 Unmatched            
##  9 Wait                 
## 10 Wait - Customer      
## 11 Wait - Implementation
## 12 Wait - User          
## 13 Wait - Vendor

The EDA shows that:

  • our data containts 7554 incident tickets (process instances)
  • there are 1511 traces (unique process instances) that show a high variability of activity chains
  • our system has only information about 4 different main ticket states (activities)
  • the four activities have 13 sub-statuses (lifecycle id’s)
  • from an activity chain perspective, an incident ticket has about 9 activities average length

Since the terminology in bupaR is not the same as it is in the IEEE standard for XES files (bupaR orientates on current literature rather than the standard’s terminology), we also take a look at the meta-information if the XES file was mapped correctly to bupaR’s terminology. As we can see, the mapping is done correctly:

data %>% mapping()
## Case identifier:     CASE_concept_name 
## Activity identifier:     activity_id 
## Resource identifier:     resource_id 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle_id

Activity Analysis from Log

Sinces activities describe the flow of the process/ticket, we take a further look on:

  • the actions performed
# Overall activity count
data %>% activities()
## # A tibble: 4 x 3
##   activity_id absolute_frequency relative_frequency
##   <fct>                    <int>              <dbl>
## 1 Accepted                 40117          0.612    
## 2 Completed                13867          0.212    
## 3 Queued                   11544          0.176    
## 4 Unmatched                    5          0.0000763
  • the order activities are performed (activity sequences aka traces)
data %>% traces()
## # A tibble: 1,511 x 3
##    trace                                     absolute_frequen… relative_frequen…
##    <chr>                                                 <int>             <dbl>
##  1 Accepted,Accepted,Completed                            1754            0.232 
##  2 Accepted,Accepted,Completed,Completed                   524            0.0694
##  3 Accepted,Accepted,Queued,Accepted,Comple…               352            0.0466
##  4 Accepted,Accepted,Queued,Accepted,Accept…               334            0.0442
##  5 Queued,Accepted,Completed,Completed                     300            0.0397
##  6 Accepted,Accepted,Accepted,Completed,Com…               230            0.0304
##  7 Accepted,Accepted,Queued,Accepted,Accept…               185            0.0245
##  8 Accepted,Accepted,Queued,Accepted,Queued…               161            0.0213
##  9 Accepted,Accepted,Accepted,Accepted,Comp…               149            0.0197
## 10 Accepted,Accepted,Queued,Accepted,Accept…               122            0.0162
## # … with 1,501 more rows
data %>% trace_explorer(coverage = 0.6)

The EDA shows that:

  • we have just four activities in our log, which will make the analysis on a more detailed level much harder
  • the activity sequences (the traces) are not that much expressive in the sense of process interpretability
  • even though the shown sequences represent 60% of all cases, there are still much more activity sequences (traces)
  • it might perhaps be necessary to combine the activities with their sub-statuses to gain more insight

Event Data Analysis: Organizational Perspective

Processes always depend on resources or actors getting things done. Even in very structured and standardized processes habits and decisions of staff members have impact on the efficiency and effectiveness of the process. Therefore we investigate:

  • Who executes the work and is therefore involved?
resources(data)
## # A tibble: 1,440 x 3
##    resource_id absolute_frequency relative_frequency
##    <fct>                    <int>              <dbl>
##  1 Siebel                    6162            0.0940 
##  2 Krzysztof                 1173            0.0179 
##  3 Pawel                      925            0.0141 
##  4 Marcin                     688            0.0105 
##  5 Marika                     605            0.00923
##  6 Michael                    587            0.00896
##  7 Fredrik                    585            0.00893
##  8 Piotr                      554            0.00845
##  9 Andreas                    542            0.00827
## 10 Brecht                     477            0.00728
## # … with 1,430 more rows
# User Organization: the business area of the user reporting the problem to the helpdesk
data %>% group_by(`organization involved`) %>% summarize(counts=n())
## # A tibble: 1 x 1
##   counts
##    <int>
## 1  65533
# Function division: The IT organization is divided into functions (mostly technology wise) 
data %>% group_by(org_role) %>% summarize(counts=n())
## # A tibble: 24 x 2
##    org_role counts
##    <chr>     <int>
##  1 A2_1       9977
##  2 A2_2       2618
##  3 A2_3       1136
##  4 A2_4       1691
##  5 A2_5        618
##  6 C_1          36
##  7 C_3           2
##  8 C_5           7
##  9 C_6         219
## 10 D_1        1488
## # … with 14 more rows
# ST (support team): the actual team that will try to solve the problem
data %>% group_by(org_group) %>% summarize(counts=n())
## # A tibble: 649 x 2
##    org_group counts
##    <chr>      <int>
##  1 A1             1
##  2 A10          146
##  3 A11           10
##  4 A12            2
##  5 A13            3
##  6 A14          106
##  7 A15            2
##  8 A16            2
##  9 A17            4
## 10 A18           35
## # … with 639 more rows
# Ticket owner (responsible for ticket during its lifecycle), works in a support team
data %>% group_by(resource_id) %>% summarize(counts=n())
## # A tibble: 1,440 x 2
##    resource_id counts
##    <fct>        <int>
##  1 -               30
##  2 Aaron           37
##  3 Abby            83
##  4 Abdelkader       1
##  5 Abdul           83
##  6 Abhijit          2
##  7 Abhimanyu        2
##  8 Abhinav         26
##  9 Abhiseka        77
## 10 Abhishek         6
## # … with 1,430 more rows
# Products serviced
data %>% group_by(product) %>% summarize(counts=n())
## # A tibble: 704 x 2
##    product counts
##    <chr>    <int>
##  1 - -          6
##  2 OTHER        6
##  3 OTHERS      49
##  4 PROD1       15
##  5 PROD102      8
##  6 PROD103     10
##  7 PROD104    317
##  8 PROD105      4
##  9 PROD106     12
## 10 PROD107     58
## # … with 694 more rows
# Incident impact classes
data %>% group_by(impact) %>% summarize(counts=n()) 
## # A tibble: 4 x 2
##   impact counts
##   <chr>   <int>
## 1 High     2707
## 2 Low     27877
## 3 Major      44
## 4 Medium  34905
# Country of Ticket Owner
data %>% group_by(`resource country`) %>% summarize(counts=n())
## # A tibble: 1 x 1
##   counts
##    <int>
## 1  65533
# Country of support team and/or function division
data %>% group_by(`organization country`) %>% summarize(counts=n())
## # A tibble: 1 x 1
##   counts
##    <int>
## 1  65533
  • Who is specialized in a certain task?
  • Who transfers work to whom?
# level options: "log", "case", "trace", "activity", "resource", "resource-activity"
data %>% resource_frequency(level = "resource-activity") %>% plot()

data %>% resource_frequency(level = "resource") %>% plot()

data %>% resource_frequency(level = "activity") %>% plot()

#"case", "resource", "resource-activity"
data %>% resource_involvement(level = "resource") %>% plot()

The EDA shows that:

  • persons with only one activity are rather specialized
  • when an activity is only performed by a limited set of ressources, brain drain might occur
  • the level of activities is too abstract for a conclusive analysis
#data %>% resource_map("resource")

Event Data Analysis: Control Flow Perspective

The control flow refers to the different successions of activities, each case can be seen as a sequence of activities.Each unique sequence is called a trace of process variance. The process can be analyzed in different by:

Metrics (for specific aspects of the process) - Start and end activities (Entry & Exit points)

data %>% start_activities("activity") %>% plot()

data %>% end_activities("activity") %>% plot()

  • Distribution of case length
  • Which activities are always present in the cases (and exceptional ones)
  • Rework (repetitions and self-loops)
# Activity presence shows in what percentage of cases an activity is present
data %>% activity_presence()
## # A tibble: 4 x 3
##   activity_id absolute relative
##   <fct>          <int>    <dbl>
## 1 Accepted        7552 1.00    
## 2 Completed       7546 0.999   
## 3 Queued          4511 0.597   
## 4 Unmatched          5 0.000662
data %>% activity_presence() %>% plot()

# level options: "log", each "case", each "activity", "resource", "resource-activity"
# Min, max and average number of repetitions
data %>% number_of_repetitions(level = "log") 
##       min        q1    median      mean        q3       max    st_dev       iqr 
## 0.0000000 0.0000000 1.0000000 0.9310299 2.0000000 3.0000000 0.9648301 2.0000000 
## attr(,"type")
## [1] "all"
# Number of repetitions per resource
data %>% number_of_repetitions(level = "resource") %>% plot()

# Number of repetitions per activity
data %>% number_of_repetitions(level = "activity") %>% plot()

Visuals

  • Process map
# A normal process map
data %>% process_map(type = frequency())
  • Trace explorer
# Shows the most frequent traces covering e.g. 60% of the event log
data %>% trace_explorer(type = "frequent", coverage = 0.6)

# Shows the most infrequent traces covering e.g. 10% of the event log
data %>% trace_explorer(type = "infrequent", coverage = 0.1)

  • Pecedence matrix (flows from one activity to another)
# Options: "absolute" or "relative" frequency, 
# "relative_antecedent" frequency, e.g. A is x% of time followed by B.
# "relative_consequent" frequency, e.g. C is x% of time preceded by D.
data %>% precedence_matrix(type = "absolute") %>% plot()

Event Data Analysis: Performance Perspective

We will now concentrate on the time perspective (in general). The process can be analyzed in different by:

Visuals

  • Performance process map
# A performance process map (shows durations)
data %>% process_map(type = performance())
# FUN = "median","min","max","mean"  units = "hours", "days"
data %>% process_map(type = performance(FUN = median, units = "hours"))
  • Dotted chart
# The dotted chart shows distributions of activities over time (x-axis: time, y_axis: cases)
data %>% dotted_chart(x = "absolute", sort = "start", units = "hours")

Metrics (for specific aspects of the process)

  • Throughput time
# throughput_time (includes active time + idle time) 
data %>% throughput_time(level = "log", units = "hours") %>% plot()

  • Processing time
# processing_time (sum of the activity durations, excludes time between activities)
data %>% processing_time(level = "activity") %>% plot()

  • Idle time
# idle_time (sum of the durations between activities)
data %>% idle_time("log", units = "days") %>% plot()

Linking Perspectives

The first way to linking perspektives is by making use of the granularity levels of the metrics: (level = “log”, “trace”, “case”, “activity”, “resource”, “resource-activity”)

e.g. By calculating the processing time at the level of resources, we can linke the organizational and performance perspective: processing_time(level = “resource”)

By analyzing rework by resources, we can link the control-flow and organizational view: number_of_repetitions(level = “resource”)

Alternatively, we might also want to include additional data attributes in the analysis. This can be done by grouping the event log. Incorporating categorical data attributes into the calculation of a process metric can be done using the group_by function, similarly as when working with regular dta in the tidyverse. Grouping on a variable will implicitly split up the event log according to different values of that variable. Any process metric which gests calculated for a grouped event log will be calculated for each group individually. The results for eacht of the group will then be combined in one single outpout, which can also be visualized using the plot function.

This workflow allows us to easily compare different groups of cases. Combining all these ingredients (data attributes, metrics, levels, plots) allows for a very flexible toolset to perform process analysis. Using the piping symbol, each of the different tools can be easily combined, e.g.:

data %>% group_by(impact) %>% number_of_repetitions(level = "resource") %>% plot()

data %>% number_of_repetitions(level = "activity") %>% arrange(activity_id)
## # Description: activity_metric[,3] [4 × 3]
##   activity_id absolute relative
##   <fct>          <dbl>    <dbl>
## 1 Accepted        4074   0.102 
## 2 Completed        514   0.0371
## 3 Queued          2445   0.212 
## 4 Unmatched          0   0

Because of this flexiblity, we can now answer almost every process-related research question you can think of.